OPEN 3 ACCESS Freely available online 

Identifying Keystone Species in the Human Gut (Bn 
Microbiome from Metagenomic Timeseries Using Sparse crossMark 
Linear Regression 

Charles K. Fisher, Pankaj Mehta* 

Department of Physics, Boston University, Boston, Massachusetts, United States of America 

Abstract 

Human associated microbial communities exert tremendous influence over human health and disease. With modern 
metagenomic sequencing methods it is now possible to follow the relative abundance of microbes in a community over 
time. These microbial communities exhibit rich ecological dynamics and an important goal of microbial ecology is to infer 
the ecological interactions between species directly from sequence data. Any algorithm for inferring ecological interactions 
must overcome three major obstacles: 1) a correlation between the abundances of two species does not imply that those 
species are interacting, 2) the sum constraint on the relative abundances obtained from metagenomic studies makes it 
difficult to infer the parameters in timeseries models, and 3) errors due to experimental uncertainty, or mis-assignment of 
sequencing reads into operational taxonomic units, bias inferences of species interactions due to a statistical problem called 
"errors-in-variables". Here we introduce an approach. Learning Interactions from Microbial Time Series (LIMITS), that 
overcomes these obstacles. LIMITS uses sparse linear regression with boostrap aggregation to infer a discrete-time Lotka- 
Volterra model for microbial dynamics. We tested LIMITS on synthetic data and showed that it could reliably infer the 
topology of the inter-species ecological interactions. We then used LIMITS to characterize the species interactions in the gut 
microbiomes of two individuals and found that the interaction networks varied significantly between individuals. 
Furthermore, we found that the interaction networks of the two individuals are dominated by distinct "keystone species", 
Bacteroides fragilis and Bacteroided stercosis, that have a disproportionate influence on the structure of the gut microbiome 
even though they are only found in moderate abundance. Based on our results, we hypothesize that the abundances of 
certain keystone species may be responsible for individuality in the human gut microbiome. 
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Introduction 

Metagenomic sequencing technologies have revolutionized the 
study of the human-associated microbial consortia making up the 
human microbiome. Sequencing methods now allow researchers 
to estimate the relative abundance of the species in a community 
without having to culture individual species [1-3]. These studies 
have shown that microbial cells vasdy outnumber human cells in 
the body, and that symbiotic microbial communities are important 
contributors to human health [1]. For example, a recent study by 
Ridaura et al [4] demonstrated that transplants of gut microbial 
consortia are sufficient to induce obesity in previously lean mice or 
to promote weight loss in previously obese mice, suggesting an 
intriguing hypothesis that the composition of the gut microbiome 
may also contribute to obesity in humans. Many other studies have 
found significant links between the composition of human- 
associated microbial consortia and diseases including cancer and 
austim spectrum disorder [4—10]. Despite the recent revelations 
highlighting the importance of the microbiome to human health, 
relatively little is known about the ecological structure and 
dynamics of these microbial communities. 



A microbial community consists of a vast number of species, all 
of which must compete for space and resources. In addition to 
competition, there are also many symbiotic interactions where 
certain species benefit from the presence of other microbial 
species. For example, a small molecule that is secreted by one 
species can be metabolized by another [11]. These species 
interactions provide a window with which to view the ecology of 
a microbial community, and allow one to make predictions about 
the effect of perturbations on a population [12]. For example, 
removing a species that engages in mutualistic interactions may 
diminish the abundance of other species that depend on it for 
survival. Given their utility for understanding the ecology of a 
community, there is tremendous interest in developing techniques 
to infer interactions between species from metagenomic data [12- 
14]. 

There are two approaches to inferring dependencies between 
microbial species from metagenomic studies: cross-sectional 
analysis, and timeseries analysis [12-16]. Cross-sectional studies 
pool samples of the relative abundances of the microbial species in 
a particular environment (e.g. the gut) from multiple individuals 
and utilize correlations in the relative abundances as proxies for 
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effective interactions between species. By contrast, timeseries 
analysis follows the relative abundances of the microbial species in 
a particular environment, for a single individual, over time and 
utilizes dynamical modeling (e.g. ordinary differential equations) to 
understand dependencies between species. 

Any methods for making reliable inferrences about species 
interactions from metagenomic studies must overcome three 
major obstacles. First, as shown below, a correlation between the 
abundances of two species does not imply that those species are 
interacting. Second, metagenomic methods measure the relative, 
not absolute, abundances of the microbial species in a community. 
This makes it difiicult to infer the parameters in timeseries models. 
Finally, errors due to experimental measurement errors and/or 
mis-assignment of sequencing reads into operational taxonomic 
units (OTUs), bias inferences of species interactions due to a 
statistical problem called "errors-in- variables" [17]. We will show 
that each of these obstacles can be overcome using a new method 
we call LIMITS (Learning Interactions from Microbial Time 
Series). LIMITS obtains a reUable estimate for the topology of the 
directed species interaction network by employing sparse linear 
regression with bootstrap aggregation ("Bagging") to learn the 
species interactions in a discrete-time Lotka-Volterra (dLV) model 
of population dynamics from a time series of relative species 
abundances [18,19]. 

Results 

Correlation does not imply interaction 

Many previous works use the correlation between the relative 
abundances of two microbial species in an environment (e.g. the 
gut) as a proxy for how much the species interact. In particular, a 
high degree of correlation between the abundances of two species 
is often taken as a proxy for a strong mutualistic interaction, and 
large anti-correlations, as indicative of a strong competitive 
interactions. Using correlations as a proxy for interactions suffers 
from several drawbacks. First, there are important subdeties 
involved in calculating correlations between species from relative 
abundances, but previous studies have presented algorithms (e.g. 
SparCC) to mitigate these problems [14]. More importandy, the 
abundances of two species may be correlated even if those species 
do not direcdy interact. For example, if species A directly interacts 
with species B, and species B directly interacts with species G, the 
abundances of species A and C are likely to be correlated even 
though they do not direcdy interact. Finally, since correlation 
matrices are necessarily symmetric, all interactions learned using 
correlations must also be symmetric. 

The problems with using correlations in species abundances as 
proxies for species interactions can be illustrated with a simple 
numerical simulation. We used the dLV model (Eq. 2) to simulate 
timeseries of the absolute abundances of 10 species for 1000 
timesteps, starting from 100 different initial conditions, for two 
arbitrary species interaction matrices (see Materials and Methods). 
See Figure SI for example time series. Figure 1 compares the 
Pearson correlation matrices calculated from the absolute species 
abundances obtained from the dLV simulations and the true 
interaction matrices. It is clear from Fig. 1 that there is no obvious 
relationship between the correlations and the species interactions. 
That is, correlations in species abundances are actually very poor 
proxies for species interactions. 

In general, the relationship between the interaction coefficients 
{Cij) and the correlations in the species abundances is described by 
a complicated non-linear function that is difficult to compute or 
utilize. This can be seen by linearizing the dynamics of lnx,(f) (Eq. 
4) around their equilibrium values, ln<(xi(?)). This stochastic 



process is a first-order autoregessive model described by 

\nx{t+\) = (o+J\nx{t) + [,{t), (1) 

where m, = — ^^^.c,y<(x/>, and Jij = fiij + cy(Xjy is the Jacobian 
obtained by linearizing around the equilibrium species abundanc- 
es, and ({t) = In t](t). If V is the covariance matrix with elements 
Fi, = <(lnx,(0-<lnx,(0»(lnx,(0-<lnx,(0»> and I is tiie 
covariance matrix of the Gaussian noise C(0) then 
vec(F) = (/^2 — ^®/) 'vec(S), where ® is the Kronecker 
product and vec is the matrix vectorization operator [20]. Thus, 
the interaction matrix is related to the covariance matrix by 
complex relation even in the linear regime of the dynamics. For 
this reason, it is very difficult, ff not impossible, to determine the 
interaction txjefficients using only knowledge of the correlations 
and equilibrium abundances. 

Cross-sectional studies that pool data across individuals and 
utihze the correlations between the abundances of different species 
as proxies for species dependencies are especially affected by this 
problem. This suggests that time-series data is likely to be more 
suited for inferring ecological interactions than cross-sectional 
data. 

Timeseries inferrence with relative species abundances 

Even though the species interaction coefficients cannot be 
inferred from the correlations in species abundances, it is possible 
to reliably infer the interaction matrix using timeseries models. To 
do so, one utilizes a discrete time Lotka-Volterra Model (dLV) that 
relates the abundance of species / at a time t+ \ (Xi(t+ 1)) to the 
abundances of all the species in the ecosystem at a time / 
{x = {xi(t), . . . ,Xf/(t)}). These interactions are encoded in the 
dLV through a set of interaction cxK-ffK ic-nts, c,y, that describe the 
influence species J has on the abundance of species i [19], and 
inferring these interaction coefficients is the major goal of this 
work. The effect of species j on species / can be beneficial {cy > 0), 
competitive (c,y<0), or the two species may not interact (Cy=0). 

As shown in the Materials and Methods, given a time-series of 
the absolute abundances of the microbes in an ecosystem, one can 
learn the interaction coefficients by performing a linear regression 
of lnx,(7+ 1)— lnx,(7) against "(x), where <x> is the vector 

of the equilibrium abundances. It is important to note that each of 
these hnear regressions can be performed independently for 
species i=l,...,N. In the following, we assume that the 
population dynamics are stable and that the equilibrium <[x> (or 
<x)) can be estimated by taking the median species abundances 
over the time series. 

Recall that most modern metagenomic technirjues can only 
measure the relative abundances of microbes, not absolute 
abundances. This introduces additional complications into the 
problem of inferring species interactions using timeseries data. 
Although it is straight forward to infer species interactions by 
applying linear regression to a timeseries of absolute abundances, 
it is not a priori clear that linear regression still works when applied 
to a timeseries of the relative abundances. An important technical 
problem that arises when using relative abundances is that the 
design matrix for the regression is singular because of the sum 

constraint on relative abundances of species (^^.x,(r) = 1). As a 
result, there is no unique solution to the ordinary least squares 
problem applied to timeseries of relative species abundances. 

Nevertheless, the design matrix can be made to be invertible if 
one, or more, of the species are not included as variables in the 
regression. 
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Figure 1. There is no simple relation between interaction coefficients and correlations in abundance, a) A symmetric interaction matrix 
and the corresponding correlation matrix, b) There is no relation between the interaction parameters and the correlations in abundance for the 
symmetric interaction matrix, c) An asymmetric interaction matrix and the corresponding correlation matrix, d) There is no relation between the 
interaction parameters and the correlations in abundance for the asymmetric interaction matrix. Points from above the diagonal in the interaction 
matrix are gray circles, whereas points from below the diagonal are black squares. In a and c, matrix elements have been scaled so that the smallest 
negative element is — 1, the largest positive element is + 1, and all elements retain their sign. In b and d, interaction coefficents were scaled so that 
the largest element by absolute value has = 
doi:1 0.1 371 /journal.pone.01 02451 .gOOl 



This insight motivates tlie use of a forward stepwise regression 
for selecting the covariate species that explain the changes in 
abundance of species /. In such a procedure, interactions and 
species are added sequentially to the regression as long as they 
improve the predictive power of model (see Fig. 2a). Since the 
design matrix now only contains a sub-set of all possible species, it 
is never singular and the linear regression problem is weU-defmed. 
Furthermore, the goal of forward selection is to include only the 
strongest, most important species interactions in the model. 
Therefore, the resulting interaction networks are sparse and, 
hence, easily interpretable. 

The procedure for forward stepwise regression is illustrated in 
Fig. 2a. We know that each species must interact with itself, so c„ is 
the only interaction coefficient allowed to be nonzero in the fu-st 
iteration. In each subsequent iteration, one additional interaction 
Cjj is included in the model by scanning over all other species and 
choosing the one that produces the lowest error at predicting a test 
dataset. This is repeated as long as the prediction error decreases 
by a pre-specifed percentage that controls the sparsity of the 
model. A larger required improvement in prediction results in 
more sparse solutions. Note that the threshold defining the 
required improvement in prediction is a relative improvement of 
one model relative to another (expressed as a percent), not a 
measure of the absolute error in out of sample prediction. 

Forward stepwise regression is a greedy algorithm, which results 
in a well-known instability [18]. This instability can be 'cured' 
using a method called bootstrap aggregation, or "Bagging" 



(Fig. 2b,c) [18]. To bag forward stepwise regression, the data are 
randomly partitioned into a training set used for the regression and 
a test set used for evaluating the prediction error. The required 
improvement in prediction is a percentage that refers to how much 
the mean squared error evaluated on the test dataset must 
decrease in order to include an additional variable in the 
regression. The random partitioning of the data into training 
and test sets is repeated many times, each one resulting in a 
different estimate for the interaction matrbc. The classical 
approach to Bagging calls for averaging these different estimates, 
but this destroys the sparsity of the solution. For this reason, we use 
the median of the estimates instead of averages. This stiU greatly 
improves the stability of the inferred interaction matrix but 
preserves its sparsity. We call our algorithm LIMITS (Learning 
Interactions from Microbial Time Series). 

Figure 3 presents the results from applying LIMITS to infer the 
same interaction matrices discussed in Figure 1. The data consist 
of either absolute or relative abundances from timeseries with 500 
timesteps and 10 different initial conditions. The sample sizes are 
quite large, but not as large as used for the calculation of the 
correlations. The inferred parameters match the true interaction 
coefficients very accurately - the smallest between the inferred 
and true parameters is 0.82 - for both the symmetric (Fig. 3a-b) 
and asymmetric (Fig. 3c-d) interaction matrices using either 
absolute or relative species abundances. 

To ensure that the exceptional performance of our sparse linear 
regression approach to inferring species interactions was not a 
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Figure 2. Schematic illustrating forward stepwise regression and median bootstrap aggregating, a) In forward stepwise regression, 
interactions are added to the model one at a time as long as including the additional covariate lowers the prediction error by a pre-defined threshold, 
b) The prediction error used for variable selection is evaluated by randomly partitioning the data into a training set used for the regression and a test 
used to evaluate the prediction error, c) Multiple models are built by repeatedly applying forward stepwise regression to random partitions of the 
data, each containing half the data points. The models are aggregated, or "bagged", by taking the median, which improves the stability of the fit 
while preserving the sparsity of the inferred interactions. 
doi:1 0.1 371/journal.pone.01 02451 .g002 



fluke due to a particular choice of interaction matrices, we 
calculated the correlation between the true and inferred param- 
eters for many randomly generated interaction matrices (see Fig. 4 



and Materials and Methods). Note that the average magnetidue of 
c,y relative to the noise in these simulations is fixed, and that these 
simulations have no measurement errors (see next section for more 



a Symmetric Interaction Matrix Inferred from Abundances Inferred from Relative Abundances [j 




C Asymmetric Interaction Matrix Inferred from Abundances Inferred from Relative Abundances d Normalized Inferred Interactions 




Normalized Inferred Interactions 



Figure 3. Example fits of interaction parameters using sparse linear regression, a) A symmetric interaction matrix (left), the corresponding 
matrix inferred from absolute abundance data (middle), and the corresponding matrix inferred from relative abundance data (right), b) There is good 
aggreement between the true and inferred interactions, from both absolute (black) and relative (gray) abundances, for the symmetric interaction 
matrix, c) An asymmetric interaction matrix (left), the corresponding matrix inferred from absolute abundance data (middle), and the corresponding 
matrix inferred from relative abundance data (right), d) There is good aggreement between the true and inferred interactions, from both absolute 
(black) and relative (gray) abundances, for the asymmetric interaction matrix. The prediction error threshold was set to 5% in for all fits. 
doi:1 0.1 371/journal.pone.01 02451. g003 
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on this). The performance of the algorithm obviously depends on 
sample size, but LIMITS generally perform admirably at inferring 
ecological interactions (see Figs. 4a-b for symmetric and 
asymmetric matrices, respectively). Furthermore, Figs. 4c-d show 
that our results do not strongly depend on the required 
improvement in prediction over most reasonable choices of 
threshold (0-5%). These results demonstrate that in the absence 
of measurement noise, LIMITS can successfully learn the 
interaction parameters from both absolute and relative abundanc- 
es. 

Inferring interactions in the presence of measurement 
errors 

Up to this point, our analyses have ignored the impact of 
"measurement noise" on the inferred species interactions. There 
are two important sources of measurement noise in metagenomic 
data. The first source is experimental noise introduced by 
sequencing errors. The second, and perhaps larger source of 
noise, is the mis-classification of sequencing reads into operational 
taxonomic units (OTUs). Most metagenomic studies rely on the 
sequencing of 16S rRNA to estimate species composition and 
diversity in a community. These 16S sequences are binned into 



groups, or OTUs, that contain sequences with a predetermined 
degree of similarity. By comparing the sequences in an OTU to 
known sequences in an annotated database, it is often possible to 
assign OTUs to particular species or strains. In general, this is an 
extremely difficult bioinformatics problem [2 1] and is likely to be a 
significant source of measurement errors. Thus, any algorithm for 
inferring species interactions must be robust to measurement 
errors. 

At first glance, it is tempting to assume that measurement noise, 
which we assume is multiplicative, simply adds to the stochastic 
(ln)7,(?)) term that acts on the dependent variable 
(In A',(?+ 1)— lnA:,(?)) and, therefore, should have little impact 
on the inferred interactions (see Methods). However, as we discuss 
below, this is not the case since the also act as the 

independent variables in the regression. Standard regression 
techniques assume that the independent variables are known 
exactly, and violation of this assumption results in biased 
parameter estimates even for asymptotically large sample sizes 
[17]. For example, in the simplest case of a 1 -dimensional 
regression Y = a-\- pX the estimator /? is always less than the real 
P, i.e. i? = COV(X,F)/VAR(X)<iS with equality only if there is 
no measurement error on X. The bias induced by using noisy 
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Figure 4. Performance of sparse linear regression as a functon of sample size and the prediction error threshold, a) Performance on 
absolute (red) and relative (black) abundances as a function of sample size for symmetric interaction matrices, b) Performance on absolute (red) and 
relative (black) abundances as a function of sample size for asymmetric interaction matrices, c) Performance on absolute (red) and relative (black) 
abundances as a function of the out-of-bag error threshold for symmetric interaction matrices, d) Performance on absolute (red) and relative (black) 
abundances as a function of the out-of-bag error threshold for symmetric interaction matrices. Error bars correspond to + one standard deviation, 
and lines connect the means. 
doi:1 0.1 371/journal.pone.01 02451. g004 
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indepedent variables in regression is known as the "errors-in- 
variables" problem in the statistical literature [17]. Analysis of the 
errors-in-variables bias for multivariate regression is more 
complicated than the 1 -dimensional example; nevertheless, it can 
be stated quite generally that the interaction parameters inferred 
in the presence of signficant measurement errors will be incorrect. 
The most reliable method for mitigating the errors-in-variables 
bias is to measure additional data on some "instrumental 
variables" that provide information on the true values of the 
relative species abundances [17]. Unfortunately, in most cases we 
often do not have access to such additional data. 

Although the errors-in-variables bias cannot be eliminated, the 
topology of the interaction network can still be reliably inferred 
using our sparse linear regression approach even when the 
measurements of the relative species abundances are very noisy. 
Knowledge of which interactions are beneficial (Ci/>0), compet- 
itive (Cy < 0), or zero {cy = 0) defines the topology of the interaction 
network. In the following simulations, we focus only on whether a 
given interaction is zero (c/y = 0) or not [c/j jt 0) because we found 
that errors in the signs of the interactions were rare (Fig. S2). The 
accuracy of the interaction topologies inferred from noisy relative 
abundance data were assessed using simulations with randomly 
chosen (asymmetric) interaction matrices. We computed the 
specificity - the fraction of species pairs correctly identified as 
non-interacting - and the sensitivity - the fraction of species pairs 
correctly identified as interacting - of the inferred topologies 
(Fig. 5) [22]. Figure 5a shows that LIMITS produced specificities 
between 60% and 80% for different choices of the prediction error 
threshold, and that the performance was relatively insenstive to 
multiplicative measurement noise up to, and even beyond, 10%. 
By contrast. Fig. 5b shows that applying forward variable selection 
to the entire dataset, i.e. without Bagging, produced results that 
were very senstive to the choice of the prediction error threshold 
and even produced specificities as low as 0%. The sensitivities for 
detecting interacting species with (Fig. 5c) or without (Fig. 5d) 
Bagging are both quite good, ranging between 70% and P>0% for 
measurement noise up to 10%. These results demonstrate that 
both the forward stepwise regression and the median bootrap 
aggregation are crucial components of the LIMITS algorithm. 
Moreover, LIMITS reliably infers the topology of the species 
interaction network even when there are significant errors-in- 
variables. 

Keystone species in the human gut microbiome 

Emboldened by the success of our algorithm on synthetic data, 
we applied LIMITS to infer the species interactions in the gut 
microbiomes of two individuals. The data from Caporaso et al [3] 
were obtained from the MGRAST database [23] in a pre- 
processed form; i.e. as relative species abundances instead of raw 
sequencing data. These data consisted of approximately a half- 
year of daily sampled relative species abundances for individual (a) 
and a full year of daily sampled relative species abundances for 
individual (b). Note that an interaction matrix has of order free 
parameters for n species. Also, Figures 4 and S2 show that the 
performance of LIMITS plateaus around rp' data points. Thus, in 
general, we require at least timepoints in order to infer the 
interactions from n species. The number of available timepoints 
was O(IOO), so we considered only the 10 most al)undant species 
from individuals (a) and (b). Because the most abundant species 
were not entirely the same in individuals (a) and (b), we studied 14 
species obtained by taking the union of the 10 most abundant 
species from individual (a) with the 10 most abundant species from 
individual (b). The 14 species are listed in Fig. 6. 



The species interaction network of the gut microbiome of 
individual (a) (shown in Fig. 6a) is dominated by the species 
Bacteroides fragilis even though it is found only in moderate 

abundances. Bacleroides fragilk has 6 outgoing interactions in 
individual (a), in contrast to the other 13 species that have 0-3 
outgoing interactions. The species interaction nc-t^vork of the gut 
microbiome of individual (b) (shown in Fig. 6b) is also dominated 
by a single species, Bacteroides stenosis, which is also found only in 
moderate abundaces. Bacteroides stercosis has 4 outgoing interac- 
tions in individual (b), in contrast to the other 1 3 species that have 
0—2 outgoing interactions. In addition, many of the interactions 
involving Bacteroides fragilis and Bacteroides stercosis are bene- 
ficial interactions. Based on these results, we refer to Bacteroides 
fragilis and Bacteroides stercosis as "keystone species" of the 
human gut microbiome because these two species exert tremen- 
dous influence on the structure of the microljial communitic's, vxvn 
though they have lower median abundances than some other 
species. 

Additionally, we observed that the species interaction topology 
of the gut microbiome of individual (a) differs substaintially from 
the species interaction topology of the gut microbiome of 
individual (b), as is clear from Fig. 6. In individual (a), Bacteroides 
fragilis is much more abundant than Bacteroides stercosis and, in 
turn, it is Bacteroides fragilis that dominates the interaction 
network of individual (a). Likewise, in individual (b), Bacteroides 
stercosis is much more abundant than Bacleroides fragilis and, in 
turn, it is Bacleroides stercosis that dominates the interaction 
network of individual (b). This observation motivates us to propose 
an intriguing hypothesis, that the abundances of certain keystone 
species are responsible for individuality of the human gut 
microbiome. Of course, much more data, from a larger 
population, will be required to confirm or reject this hypothesis. 

Discussion 

Metagenomic methods are providing an unprecedented window 
into the composition and structure of micriobial communities. 
They are revolutionizing our knowledge of microbial ecology and 
highlight the important roles played by the human microbiome in 
health and disease. Nevertheless, it is important to carefully 
consider the tools used to analyze these data and to address their 
associated challenges. We have highlighted three major obstacles 
that must be addressed by any study designed to use metagenomic 
data to analyze species interactions: 1) a correlation between the 
abundances of two species does not imply that those species are 
interacting, 2) the sum constraint on the relative abundances 
obtained from metagenomic studies makes it difficult to infer the 
parameters in timeseries models, and 3) errors due to experimental 
uncertainty, or mis-assignment of sequencing reads into opera- 
tional taxonomic units (OTUs), bias inferrences of species 
interactions due to a statistical problem called "errors-in- 
variables". 

To overcome these obstacles, we have introduced a novel 
algorithm, LIMITS, for inferring species interaction coefficients 
that combines sparse linear regression with bootstrap aggregation 
(Bagging). Our method provides reliable estimates for the topology 
of the species interaction network even when faced with significant 
measurement noise. The interaction networks constructed using 
our approach are sparse, including only the strongest ecological 
interactions. Regularizing the inference of the interaction network 
by favoring sparse solutions has the benefit that the results are 
easily interpretable, enabling the identification of keystone species 
with many important interactions. Furthermore, our work suggests 
that it is difficult to learn species interactions from cross-sectional 
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Figure 5. Sensitivity and specificity of predicted interactions as a function of measurement error for bagged and unbagged models. 

Specificity refers to the fraction of species pairs correctly identified as non-interacting, while sensitivity refers to the fraction of species pairs correctly 
identified as interaction. Both measures range from 0 (poor performance) to 1 (good performance), a) Specifity of sparse linear regression with 
Bagging as a function of measurement error for different prediction error thresholds, b) Specificity of sparse linear regression trained on the entire 
data set without Bagging as a function of measurement error for different prediction thresholds, c) Sensitivity of sparse linear regression with 
Bagging as a function of measurement error for different prediction error thresholds, d) Sensitivity of sparse linear regression trained on the entire 
data set without Bagging as a function of measurement error for different prediction error thresholds. Notice that without bagging, model 
performance is extremely sensitive to choice of the threshold for the required improvement in prediction for adding new interactions. 
doi:1 0.1 371 /journal.pone.01 02451 .g005 




Legend 
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1 Akkermansia muciniphlla 

2 Afistipes putredinis 

3. Bacteroides acidifaciens 

4. Bacteroides fragilis 
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Figure 6. Interaction topologies of abundant species in the guts of two individuals. The size of a node denotes the median relative species 
abundance, beneficial interactions are shown as solid red arrows, and competetive interactions are shown as dashed blue arrows. In individual a) 
species 4 Bacteroides fragilis acts as a keystone species with 6 outgoing interactions, compared to a median number of outgoing interactions of 1. In 
individual b) species 5 Bacteroides stercosis acts as a keystone species with 4 outgoing interactions, compared a median number of outgoing 
interactions of 1. The 14 species included in the model were obtained by taking the union of the top 10 most abundant species from individuals a 
and b. The required improvement in prediction was set to 3%, graphs obtained using other prediction thresholds are shown in the Supporting 
Information. 
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PLOS ONE I www.plosone.org 



7 



July 2014 I Volume 9 | Issue 7 | e102451 



Identifying Keystone Species in the Gut Microbiome 



studies that pool samples of the relative abundances of the 
microbial species from multiple individuals. This highlights the 
importance of collecting extended time-series data for understand- 
ing microbial ecological dynamics. 

We applied LIMITS to time-series data to infer ecological 
interaction networks of two individuals and found that the 
interaction networks are dominated by distinct keystone species. 
This motivated us to propose a hypothesis: that the abundances of 
certain keystone species are responsible for individuality of the 
human gut microbiome. While more data will be required to 
confirm or rej(;ct this hypothesis, it is intriguing to examine its 
potential consequences for the human microbiome. The keystone 
species hypothesis implies that even small perturbations to an 
environment can have a large impact on the composition of its 
resident microbial consortia if those perturbations affect a small 
number of important "keystone" species. Moreover, relatively 
small differences in individual diets, or minor differences in the 
interaction between the host immune system and the gut 
microbiota, that affect keystone species may be sufficient to 
organize gut microbial consortia into distinct types of communi- 
ties, or "enterotypes" [24,25]. 

Our analysis identified the closely related species Bacteroides 
fragilis and Bacteroides stercosis as potential keystone species of the 
gut microbiome [26]. Previous studies have suggested that the 
abundance of Bacteroides fragilis modulates the levels of several 
metabolites and, in turn, the composition of the gut microbiome in 
a mouse model of gastrointestinal abnormalities associated with 
autism spectrum disorder [9]. Abundance of both Bacteroides 
fragilis and Bacteroides stercosis are associated with an increased 
risk of colon cancer [5,6,10], and previous authors even suggest 
that Bacteroides fragilis acts as a critical "keystone pathogen" in 
the development of the disease [8] . Classical ecological models of 
species interaction demonstrate that the manner of the interaction 
between two species is not solely a function of their identity, but is 
highly dependent on the environment in which the interaction 
takes place [27,28]. Increased abundance ot Bacteroides species is 
associated with high fat diets, including typical Western diets with 
a high consumption of red meat that are associated with increased 
cancer risk [5]. It is possible that Bacteroides fragilis and 
Bacteroides stercosis act as keystone species in individuals 
consuming high fat diets due to their ability to convert bile into 
metabolites that are used by other members of the microbial 
community [5,9]. 

The keystone species hypothesis can be experimentally tested by 
perturbing the abundance of individual species in a microbial 
consortium and observing the effect on the composition of the 
community. Our prc^diction is that most perturbations wiU have 
Utde impact on the overall structure of the microbial community, 
but perturbations applied to a small number of keystone species 
win have a large impact on the structure of the community. Due to 
ethical concerns, it is difficult to envision a direct experimental test 
of the keystone species hypothesis in human microbiota and, 
therefore, to test our specific predictions in regards to the keystone 
species Bacteroides fragils and Bacteroides stercosis. Nevertheless, 
experimental tests could be performed in animal models, or even 
in culture if a large enough microbial consortia can be assembled. 

Materials and Methods 

Discrete Time Lotica-Volterra Model for absolute 
abundances 

Metagenomic sequencing methods have made it possible to 
follow the time evolution of a microbial population by determining 
the relative abundances of the species in a community in discrete 



intervals (e.g. one day). Given the discrete nature of these data, it is 
most sensible to use a discrete-time model of population dynamics. 
The discrete-time Lotka-Volterra (dLV) model of population 
dynamics (sometimes called the Ricker model) relates the 
abundance of species / at a time t-'rSt {Xi{t-\-&t)) to the 
abundances of all the species in the ecosystem at a time / 
(x = . . . Xjv(/)}). These interactions are encoded in the dLV 

through a set of interaction coefficients, Cy, that describe the 
influence .species j has on the abundance of .species i [19], and 
inferring these interaction coefficients is the major goal of this 
work. The abundance of a species can also change due to 
environmental and demographic stochastic effects. The dLV can 
be generalized to include stochasticity by including a log-normally 
distributed multiplicative noise, »7,(0- Specifically, the dynamics 
are modeled by the equations 

x,(/ + dt) = riiit)xi(t) exp {dt ^ Cijixjit) - <x,»), (2) 
J 

where <(Xy ) is equilibrium abundance of species j and is set by the 
carrying capacity of the environment. In writing tlicsc' equations in 
this form, we have assumed that in the absence of noise the dLV 
equations have a unique steady-state solution with the abundances 
given by (^Xj}. Notice that in the absence of multiplicative noise 
and the limit dt->0, the dLV model reduces to the usual Lotka- 
Volterra differential equations: 



In what follows, without loss of generality, we set dt=l. This is 
equivalent to measuring time in units ofdt. To fit microbial data, it 
is actually helpful to work with the logarithm of Ecjuation (2). 
Furthermore, we assume that the sampling time and the update 
time are both equal to 1 . Taking the logarithm of Eq. 2 yields, 

\nxi(t+\)- lnxi(t) = i:i(t)+J2ciAxA0-<Xj}), (4) 

j 

where by construction f ,(?) = In r]j(t) is a normally distributed 
variable. This logarithmic form of the dLV model is especially 
convenient for inferring species interactions from time series of 
species abundances because the inference problem reduces to 
standard linear regression (as discussed below). 

Thus, far we have assumed that it is possible to directly measure 
the absolute abundances However, in practice, metagenomic 
sequencing studies typically provide relative abundances 
Xi = Z~^Xi, where Z = ^^jXj. Provided that the number of 
species is large (A^» 1), the fluctuations in the total population size 
Z(;) = ^^.x,(r) around its mean value <Z> = ^^.<x,> will be 
small. In this case, the dynamics of the relative abundances (x,-(0) 
are well-described by a modified dLV model: 

Xi{t+ 1) « rji{t)xi{t) exp ( ^ Cijixjit) - <x,»), (5) 

J 

where we have defined the new interaction coefficients Cy = iZycy 
which are related to the true interaction coefficients c,y by the 
average population size. Thus, relative species abundance data can 
be modeled using the dLV model, but the interaction coefficients 
are known only up to an arbitrary multiplicative constant. 
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However, as discussed in the main text, the design matrix for 
relative species abundances is singular so simple linear regression 
fails. 

In all of the simulations discussed in the main text the 
stochasticity was set to ln)7,(f)~A/'(0,0.1). 

Linear Regression 

Suppose we are given data consisting of the absolute (or relative) 
abundances of the species in a population of species in the form 
of timeseries of length T starting from M initial conditions. We 
infer each row (Ci = {c,j}j^i) of the interaction coefficient matrix 
(C) separately. The equilibrium population <x) (or (^X)) is 
assumed to correspond to the population median. The design 
matrix is an M(T —\)xN matrix with rows Xi = 

{xf\t) — {^X\'},...,x^^\t) — '^XN)'}. The data vector has length 
M(r-l) and is given by v,- = {Inxf (1)- lnxf\0),...,lnxf' (T) 
- In xf'(r- 1)}. Note that any of the timepoints with xf\t) = Q 
were left out of the regression because the logarithm of zero is 
undefined. The least squares estimate for the interaction 
coefficients is c, =X''"v,-, where the + denotes the psuedo-inverse. 

Outline of the LIMITS Algorithm 

Here, we present a high-level oudine of the LIMITS algorithm 
(see Fig. 2). An implementation of the LIMITS algorithm written 

in Mathematica (Wolfram Research, Inc.) is available in the 
Supporting Information and on the author's webpage at http:// 
physics.bu.edu/~pankajm. Since all of the regressions are 
performed independently for each species, we will only describe 
the algorithm for inferring a row of the interaction matrix 
(c, = {Ci,}j^j). One simply loops over / to obtain the fuU 
interaction matrix. Moreover, bootstrap aggregating simply 
involves performing the whole proceedure L times, thereby 
constructing multiple estimates cf \ ■ ■ ■ ,c,'^' and taking their 
median. Thus, the only thing that takes some effort to explain is 
how to construct one of estimate of c,. 

1 . First, we randomly partition the data into a training set and a 
test set, each containing half the data points. 

2. A set of active coefficients is initialized to ACTIVE = {c,, } and 
a set of inactive coefficients is initialized to INACTIVE = 

3. A linear regression including only species i is performed on the 
training set, and the inferred coefficient is used to calculate a 
prediction error (called ERROR) for the test dataset. 

4. For each coefficient c in INACTIVE create TEST = ACTIVE 
[J{c} and perform a linear regression using the coefficients in 
TEST against the training dataset. 

5. Next, the inferred coefficients are used to calculate the 
prediction errors for the test dataset. The particular species j 
with the smallest prediction error is retained, and we call this 
error ERRORO). 

6. If 100x(ERROR-ERROR(/))/ERROR is greater than a 
pre-specified required improvement in prediction then we set 
ERROR = ERRORO), ACTIVE = ACTIVE [J {cy}, and Cy 
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Generating Test Interaction Matrices 

Interaction matrices (of size 10x10 for 10 species) were 
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equilibrium abundance of species i. Next, random Cy parameters 
were added to the matrix one at a time as long as the Lotka- 
Volterra system remained stable until 10 off diagonal interactions 
had been added. The algorithm we used for generating random 
interaction matrices is provided in Code SI. 

Supporting Information 

Code SI Code for the LIMITS algorithm and an 
example implemented in Mathematica (Wolfram Re- 
search, Inc.) Version 8.0. 

(NB) 

Figure SI Example Lotka-Volterra Time Series. Repre- 
sentative simulations of Lotka-Volterra time series with a) a 
symmetric interaction matrix and b) an asymmetric interaction 
matrix. 
(TIFF) 

Figure S2 Errors in Interaction Topology as a Function 
of Sample Size. The errors in inferred interaction topologies as a 
function of sample size for a) a symmetric interaction matrices and 
b) an asymmetric interaction matrices. The black line represents 
the true positi\-e rate, the dashed blue line represents the false 
negative rate, and the red line represents the rate of sign errors. 
(TIFF) 

Figure S3 Interaction topologies of abundant species in 
the guts of two individuals using different prediction 
error thresholds. The size of a node denotes the median 

relative species abundance, beneficial interactions are shown as 
solid red arrows, and competetive interactions are shown as 
dashed blue arrows. The 14 species included in the model were 
obtained by taking the union of the top 10 most abundant species 
from individuals a and b. 
(TIFF) 
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